Engineering posts about Resilience Engineering

Curated summaries and key learnings for engineers working with Resilience Engineering.

Cloudflare
11m

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

The article outlines the completion of Cloudflare's 'Code Orange: Fail Small' initiative, aimed at enhancing the resilience and reliability of its network infrastructure. Key improvements include the...

DigitalOcean
13m

From Incident Counting to SLIs: How DigitalOcean Rethought Availability

The article discusses DigitalOcean's transition from an incident-counting methodology to a more nuanced SLI-based approach for measuring availability. Initially, the company relied on a simplistic...

Cloudflare
8m

A one-line Kubernetes fix that saved 600 hours a year

The article discusses a critical performance issue encountered with Kubernetes when managing the Atlantis tool for Terraform changes. The problem stemmed from slow restarts due to a default behavior...

Meta (Facebook)
5m

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

The article discusses the implementation of backend aggregation (BAG) in Meta's Prometheus AI clusters, highlighting its role in interconnecting thousands of GPUs across multiple data centers. BAG...

GitHub
6m

When protections outlive their purpose: A lesson on managing defense systems at scale

The article outlines the challenges faced by GitHub in managing defense mechanisms that protect the platform from abuse while ensuring legitimate users are not adversely affected. It highlights the...

Cloudflare
11m

Code Orange: Fail Small — Our resilience plan following recent incidents

The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan,...

Databricks
3m

Welcoming Stately Cloud to Databricks: Investing in the Foundation for Scalable AI Applications

The article highlights Databricks' acquisition of Stately Cloud, emphasizing the importance of building a robust foundation for scalable AI applications. It discusses the expertise of the Stately...

Atlassian
15m

Pull request intervention for infrastructure-as-code risks with Bitbucket custom merge checks

The article discusses Atlassian's approach to mitigating risks associated with infrastructure-as-code through the implementation of Bitbucket custom merge checks. It highlights the importance of...